Introduction

The data comes in various different shapes. Some have the observations spreaded to columns, some given in rows. Mostly you have gaps in the data.

In this lab, we will particularly be interested in time series, which differs from many other data types and has a separate way of analyzing. In this tutorial you will learn how to:

  • preprocess the data (clean NAs, change the data structure etc.)
  • play with date objects
  • plot time series
  • analyze trend and seasonality

Getting Started

We will visualize Covid Confirmed Cases dataset by John Hopkins University, and Maddison Project GDP Per Capita dataset. To be able to follow you need:

  1. to download covid_confirmed.csv and maddison.csv data uploaded on LEARN
  2. some packages including
  • tidyverse
  • lubridate
  • gganimate (takes effort to install)
  • plotly (takes time to install)
##   X Province        Country            Date Confirmed Deaths Recovered
## 1 1          Mainland China 1/22/2020 17:00         1     NA        NA
## 2 2          Mainland China 1/22/2020 17:00        14     NA        NA
## 3 3          Mainland China 1/22/2020 17:00         6     NA        NA
## 4 4          Mainland China 1/22/2020 17:00         1     NA        NA
## 5 5          Mainland China 1/22/2020 17:00        NA     NA        NA
## 6 6          Mainland China 1/22/2020 17:00        26     NA        NA
##             X Province            Country                Date Confirmed Deaths
## 277131 277131                     Vietnam 2020-06-13 03:33:14       333      0
## 277132 277132          West Bank and Gaza 2020-06-13 03:33:14       489      3
## 277133 277133              Western Sahara 2020-06-13 03:33:14         9      1
## 277134 277134                       Yemen 2020-06-13 03:33:14       632    139
## 277135 277135                      Zambia 2020-06-13 03:33:14      1321     10
## 277136 277136                    Zimbabwe 2020-06-13 03:33:14       343      4
##        Recovered
## 277131       323
## 277132       414
## 277133         6
## 277134        28
## 277135      1104
## 277136        51

Preprocessing Data

Datetime in R

Dates and Datetime are special data types that are processed and used in a different fashion. Mostly data has them in a particular format, but that format is not universal. We can write the birthday of Guthrie Govan in many ways, 12/27/1971, 12/27/71, 27.12.1971, 27/12/1971 and so on. Luckily there is a ISO format for the dates in form of 1971-12-27 and this is R’s default format.

If you read a csv file, by default, it will be read as factor. You can convert it as below:

## [1] "character"
## [1] "Date"
## [1] "1971-12-27"

If the date is not in YYYY-MM-DD format, you need to specify the input format:

## [1] "Date"
## [1] "1971-12-27"

You can notice that there is a different language for the format:

Code Value Example
%d day 27
%m month 12
%b month (string) Dec
%B month (long str.) December
%a weekday (string) Mon
%A weekday (long str.) Monday
%y year (2 digits) 71
%Y year (4 digits) 1971
%H hour 21
%M minute 30
%S second 0

So if you have in other formats you can convert it easily:

## [1] "1971-12-27"

If you receive NA, it means it couldn’t process:

## [1] NA

After a good conversion, you can easily carry out mathematical operations:

## Time difference of 48.52329 days
## Time difference of 49.53151 days
## Time difference of 368 days

If you have datetime data then you can convert it using strptime:

## [1] "1971-12-27 21:30:00 GMT"

You can also give any format you want and print:

## [1] "Monday, December 27, 1971 21:30:00"

Lubridate Package

It is not easy to extract some good information by using the above methods. What you need is lubridate:

## [1]   27   12 1971
## [1] 1971    4
## [1] "52"     "2"      "Monday"

It also has a separate function for datetime, as_datetime, which is doing the same thing as strptime:

## [1] "1971-12-27 21:30:00 GMT"

but it also has some handy functions that requires less effort and very generic:

## [1] "1971-12-27 21:30:00 UTC"
## [1] "1971-12-27 21:30:00 UTC"
## [1] "1971-12-27 21:30:00 UTC"
## [1] "1971-12-27 21:30:00 UTC"
## [1] "1971-12-27"

Earthquake Data

Now we know how to process the data:

##         Date     Time Latitude Longitude       Type Depth Depth.Error
## 1 01/02/1965 13:44:18   19.246   145.616 Earthquake 131.6          NA
## 2 01/04/1965 11:29:49    1.863   127.352 Earthquake  80.0          NA
## 3 01/05/1965 18:05:58  -20.579  -173.972 Earthquake  20.0          NA
## 4 01/08/1965 18:49:43  -59.076   -23.557 Earthquake  15.0          NA
## 5 01/09/1965 13:32:50   11.938   126.427 Earthquake  15.0          NA
## 6 01/10/1965 13:36:32  -13.405   166.629 Earthquake  35.0          NA
##   Depth.Seismic.Stations Magnitude Magnitude.Type Magnitude.Error
## 1                     NA       6.0             MW              NA
## 2                     NA       5.8             MW              NA
## 3                     NA       6.2             MW              NA
## 4                     NA       5.8             MW              NA
## 5                     NA       5.8             MW              NA
## 6                     NA       6.7             MW              NA
##   Magnitude.Seismic.Stations Azimuthal.Gap Horizontal.Distance Horizontal.Error
## 1                         NA            NA                  NA               NA
## 2                         NA            NA                  NA               NA
## 3                         NA            NA                  NA               NA
## 4                         NA            NA                  NA               NA
## 5                         NA            NA                  NA               NA
## 6                         NA            NA                  NA               NA
##   Root.Mean.Square           ID Source Location.Source Magnitude.Source
## 1               NA ISCGEM860706 ISCGEM          ISCGEM           ISCGEM
## 2               NA ISCGEM860737 ISCGEM          ISCGEM           ISCGEM
## 3               NA ISCGEM860762 ISCGEM          ISCGEM           ISCGEM
## 4               NA ISCGEM860856 ISCGEM          ISCGEM           ISCGEM
## 5               NA ISCGEM860890 ISCGEM          ISCGEM           ISCGEM
## 6               NA ISCGEM860922 ISCGEM          ISCGEM           ISCGEM
##      Status
## 1 Automatic
## 2 Automatic
## 3 Automatic
## 4 Automatic
## 5 Automatic
## 6 Automatic
## [1] "01/02/1965 13:44:18" "01/04/1965 11:29:49" "01/05/1965 18:05:58"
## [4] "01/08/1965 18:49:43" "01/09/1965 13:32:50" "01/10/1965 13:36:32"
## [1] "character"
## Warning: 3 failed to parse.
## Warning: 3 failed to parse.
## [1] "1965-01-02 13:44:18 UTC" "1965-01-04 11:29:49 UTC"
## [3] "1965-01-05 18:05:58 UTC" "1965-01-08 18:49:43 UTC"
## [5] "1965-01-09 13:32:50 UTC" "1965-01-10 13:36:32 UTC"

Basic operations

In R, there are some operations that you must now by heart. These are selecting row and columns, slicing the data and so on. The “by heart” here means you must know how they work.

Assume you have the following data:

##              Name      Instrument   Birthday
## 1   Guthrie Govan          guitar 27.12.1971
## 2 Marco Minnemann           drums 24.12.1970
## 3      Tony Levin            bass   6.6.1946
## 4   Steven Wilson guitar/keyboard  3.11.1967
## 5   Jordan Rudess        keyboard  4.11.1956

We can convert the Birthdays as to date and calculate their age as we have shown above:

##              Name      Instrument   Birthday      Age
## 1   Guthrie Govan          guitar 1971-12-27 48.52329
## 2 Marco Minnemann           drums 1970-12-24 49.53151
## 3      Tony Levin            bass 1946-06-06 74.09863
## 4   Steven Wilson guitar/keyboard 1967-11-03 52.67397
## 5   Jordan Rudess        keyboard 1956-11-04 63.67671

Selecting Columns

Since this is a dataframe, you can use musicians$birthday to choose the third column. But for matrices it wont’ work. There is another notation that you need to know: You can select certain columns and rows using [ , ] notation:

## [1] Guthrie Govan   Marco Minnemann Tony Levin      Steven Wilson  
## [5] Jordan Rudess  
## 5 Levels: Guthrie Govan Jordan Rudess Marco Minnemann ... Tony Levin
##              Name      Instrument   Birthday
## 1   Guthrie Govan          guitar 1971-12-27
## 2 Marco Minnemann           drums 1970-12-24
## 3      Tony Levin            bass 1946-06-06
## 4   Steven Wilson guitar/keyboard 1967-11-03
## 5   Jordan Rudess        keyboard 1956-11-04
##              Name      Age
## 1   Guthrie Govan 48.52329
## 2 Marco Minnemann 49.53151
## 3      Tony Levin 74.09863
## 4   Steven Wilson 52.67397
## 5   Jordan Rudess 63.67671

or we can drop the rows 2 and 3 by using ‘-’ sign:

##              Name      Age
## 1   Guthrie Govan 48.52329
## 2 Marco Minnemann 49.53151
## 3      Tony Levin 74.09863
## 4   Steven Wilson 52.67397
## 5   Jordan Rudess 63.67671
##              Name      Age
## 1   Guthrie Govan 48.52329
## 2 Marco Minnemann 49.53151
## 3      Tony Levin 74.09863
## 4   Steven Wilson 52.67397
## 5   Jordan Rudess 63.67671

Tidyverse way

There is a special function defined in tidyverse, select for selecting columns:

##              Name   Birthday
## 1   Guthrie Govan 1971-12-27
## 2 Marco Minnemann 1970-12-24
## 3      Tony Levin 1946-06-06
## 4   Steven Wilson 1967-11-03
## 5   Jordan Rudess 1956-11-04

It also allowing choosing a range of columns using their names:

##              Name      Instrument   Birthday
## 1   Guthrie Govan          guitar 1971-12-27
## 2 Marco Minnemann           drums 1970-12-24
## 3      Tony Levin            bass 1946-06-06
## 4   Steven Wilson guitar/keyboard 1967-11-03
## 5   Jordan Rudess        keyboard 1956-11-04

Selecting Rows

##              Name Instrument   Birthday      Age
## 1   Guthrie Govan     guitar 1971-12-27 48.52329
## 2 Marco Minnemann      drums 1970-12-24 49.53151
##              Name      Instrument   Birthday      Age
## 1   Guthrie Govan          guitar 1971-12-27 48.52329
## 2 Marco Minnemann           drums 1970-12-24 49.53151
## 4   Steven Wilson guitar/keyboard 1967-11-03 52.67397

or we can drop the rows 3 and 4 by using ‘-’ sign:

##              Name Instrument   Birthday      Age
## 1   Guthrie Govan     guitar 1971-12-27 48.52329
## 2 Marco Minnemann      drums 1970-12-24 49.53151
## 5   Jordan Rudess   keyboard 1956-11-04 63.67671

We can also use logical arrays as filters:

##              Name      Instrument   Birthday      Age
## 1   Guthrie Govan          guitar 1971-12-27 48.52329
## 2 Marco Minnemann           drums 1970-12-24 49.53151
## 4   Steven Wilson guitar/keyboard 1967-11-03 52.67397

The musicians younger than age 50:

## [1]  TRUE  TRUE FALSE FALSE FALSE
##              Name Instrument   Birthday      Age
## 1   Guthrie Govan     guitar 1971-12-27 48.52329
## 2 Marco Minnemann      drums 1970-12-24 49.53151

Guitarists:

## [1]  TRUE FALSE FALSE FALSE FALSE
##            Name Instrument   Birthday      Age
## 1 Guthrie Govan     guitar 1971-12-27 48.52329

It doesn’t look okay. Another function, grepl, is very handy when text filtering. It returns a logical vector that tells whether elements of a certain vector contains a particular phrase. For example the musicians playing guitar can be retrieved as:

## [1]  TRUE FALSE FALSE  TRUE FALSE
##            Name      Instrument   Birthday      Age
## 1 Guthrie Govan          guitar 1971-12-27 48.52329
## 4 Steven Wilson guitar/keyboard 1967-11-03 52.67397

Tidyverse way

In tidyverse package there is a special function for it, which does the same thing as subset does. So you may ask why. Don’t ask me.

##              Name Instrument   Birthday      Age
## 1   Guthrie Govan     guitar 1971-12-27 48.52329
## 2 Marco Minnemann      drums 1970-12-24 49.53151

and the guitarista extraordinare:

##            Name Instrument   Birthday      Age
## 1 Guthrie Govan     guitar 1971-12-27 48.52329

and the guitarists:

##            Name      Instrument   Birthday      Age
## 1 Guthrie Govan          guitar 1971-12-27 48.52329
## 2 Steven Wilson guitar/keyboard 1967-11-03 52.67397

Row/Column Bind

Sometimes we want to attach a row or a column to an existing dataset. For example, we may want to add a new row about a new musician as below:

rbind and cbind attaches new row and column respectively:

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    5    9   13   17
## [2,]    2    6   10   14   18
## [3,]    3    7   11   15   19
## [4,]    4    8   12   16   20
## [5,]  101  105  109  113  117
## [6,]  102  106  110  114  118
## [7,]  103  107  111  115  119
## [8,]  104  108  112  116  120
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    1    5    9   13   17  101  105  109  113   117
## [2,]    2    6   10   14   18  102  106  110  114   118
## [3,]    3    7   11   15   19  103  107  111  115   119
## [4,]    4    8   12   16   20  104  108  112  116   120

Summarizing

In R, there are some predefined functions that work pretty fine and fast. You have to know them if you want to learn R. Here they are:

function description example
apply() applies a function to rows(1) or columns(2) apply(data,2,mean)
lapply() applies a function to using a list of elements lapply(list_of_data,mean)
sapply() same as lapply, but simplifies to matrix sapply(list_of_data,mean)
summarize() applies functions only to columns summarize(data, average=mean(Age))
##   Year Canada France Germany Italy Japan United.Kingdom United.States
## 1 1950   7291   5186    3881  3172  1921           6939          9561
## 2 1951   7533   5461    4206  3451  2126           7123         10116
## 3 1952   7833   5564    4553  3591  2336           7091         10316
## 4 1953   7984   5684    4905  3830  2474           7346         10613
## 5 1954   7699   5915    5247  3947  2582           7619         10359
## 6 1955   8201   6199    5797  4190  2771           7868         10897

apply

For the average of each country over the whole year range, we need to apply mean to all columns (2):

##           Year         Canada         France        Germany          Italy 
##        1980.00       15685.56       14033.89       13221.28       12147.34 
##          Japan United.Kingdom  United.States 
##       12891.92       14202.72       19479.46

The year is not included in the economic indicators. We could drop before apply:

##         Canada         France        Germany          Italy          Japan 
##       15685.56       14033.89       13221.28       12147.34       12891.92 
## United.Kingdom  United.States 
##       14202.72       19479.46

For the average of each year over the whole countries, we need to apply mean function to all rows (1):

##  [1]  5421.571  5716.571  5897.714  6119.429  6195.429  6560.429  6776.571
##  [8]  6932.143  6976.429  7283.857  7610.143  7863.000  8188.000  8481.000
## [15]  8911.000  9270.429  9681.286  9969.286 10464.000 10960.286 11311.000
## [22] 11608.286 12083.429 12741.429 12827.143 12772.571 13302.714 13680.714
## [29] 14147.143 14635.429 14702.714 14833.714 14767.571 15070.000 15607.286
## [36] 16067.857 16461.286 16929.571 17603.143 18054.143 18168.714 18219.143
## [43] 18344.286 18368.000 18811.143 19158.714 19450.286 19977.571 20364.857
## [50] 20866.286 21544.857 21763.857 21945.286 22192.286 22635.571 22980.286
## [57] 23437.286 23851.143 23658.286 22575.000 23114.571

Sometimes we may need a more special function. For example if we want to calculate the last 5 years’ average for each country, we could write:

##         Canada         France        Germany          Italy          Japan 
##        24966.2        21764.4        20368.0        19146.2        21884.2 
## United.Kingdom  United.States 
##        24231.0        30930.8

Or you can do it more compact, in one line, as:

##         Canada         France        Germany          Italy          Japan 
##        24966.2        21764.4        20368.0        19146.2        21884.2 
## United.Kingdom  United.States 
##        24231.0        30930.8

lapply/sapply

lapply is the apply function for lists, not data frames. Here the input can be an array or a list, and the output will be a list:

## [[1]]
## [1] 15685.56
## 
## [[2]]
## [1] 14033.89
## 
## [[3]]
## [1] 13221.28
## 
## [[4]]
## [1] 12147.34
## 
## [[5]]
## [1] 12891.92
## 
## [[6]]
## [1] 14202.72
## 
## [[7]]
## [1] 19479.46

sapply is the lapply function, but tries to simplify the output into arrays, if it cannot, it returns the same output as lapply. for lists, not data frames. Here the input can be an array or a list, and the output will be a list:

## [1] 15685.56 14033.89 13221.28 12147.34 12891.92 14202.72 19479.46

summarise

The summarise function comes with tidyverse package. It works similar to apply, but can be more flexible for some cases:

##    Average SDeviance AvgLast5
## 1 15685.56  5621.451    24954

It is not much handy when you have many columns, but it has a version which works fine:

##     Canada   France  Germany    Italy    Japan United.Kingdom United.States
## 1 15685.56 14033.89 13221.28 12147.34 12891.92       14202.72      19479.46
##   Canada_Average France_Average Germany_Average Italy_Average Japan_Average
## 1       15685.56       14033.89        13221.28      12147.34      12891.92
##   United.Kingdom_Average United.States_Average Canada_SDeviance
## 1               14202.72              19479.46         5621.451
##   France_SDeviance Germany_SDeviance Italy_SDeviance Japan_SDeviance
## 1         5350.765          4977.198        5451.625        6955.059
##   United.Kingdom_SDeviance United.States_SDeviance Canada_Last5 France_Last5
## 1                 5456.816                6965.634      24966.2      21764.4
##   Germany_Last5 Italy_Last5 Japan_Last5 United.Kingdom_Last5
## 1         20368     19146.2     21884.2                24231
##   United.States_Last5
## 1             30930.8

Also it has summarise_at version which summarizes a given list or range of colums

##     Canada  Germany    Japan
## 1 15685.56 13221.28 12891.92
##    Germany    Italy    Japan
## 1 13221.28 12147.34 12891.92

Creating New Variables

We already have seen some ways to create new variable. You simply assign them. But there are other ways that can be more handy.

Let’s, for example, create a new column, decade, which may be handy for grouping:

##   Year Canada France Germany Italy Japan United.Kingdom United.States Decade
## 1 1950   7291   5186    3881  3172  1921           6939          9561   1950
## 2 1951   7533   5461    4206  3451  2126           7123         10116   1950
## 3 1952   7833   5564    4553  3591  2336           7091         10316   1950
## 4 1953   7984   5684    4905  3830  2474           7346         10613   1950
## 5 1954   7699   5915    5247  3947  2582           7619         10359   1950
## 6 1955   8201   6199    5797  4190  2771           7868         10897   1950

or you can use mutate from the tidyverse package:

##   Year Canada France Germany Italy Japan United.Kingdom United.States Decade
## 1 1950   7291   5186    3881  3172  1921           6939          9561   1950
## 2 1951   7533   5461    4206  3451  2126           7123         10116   1950
## 3 1952   7833   5564    4553  3591  2336           7091         10316   1950
## 4 1953   7984   5684    4905  3830  2474           7346         10613   1950
## 5 1954   7699   5915    5247  3947  2582           7619         10359   1950
## 6 1955   8201   6199    5797  4190  2771           7868         10897   1950

Grouping

In previous weeks, we have seen how to group the data by some variables. It is worth mentioning here too:

## # A tibble: 7 x 3
##   Decade CanadaAvg  USAvg
##    <dbl>     <dbl>  <dbl>
## 1   1950     8101  10556.
## 2   1960    10232. 13158.
## 3   1970    14202. 16745.
## 4   1980    17323. 20410 
## 5   1990    19324. 24801.
## 6   2000    24006. 30107.
## 7   2010    24941  30491

or we can do it to all countries as:

## # A tibble: 7 x 8
##   Decade Canada France Germany  Italy  Japan United.Kingdom United.States
##    <dbl>  <dbl>  <dbl>   <dbl>  <dbl>  <dbl>          <dbl>         <dbl>
## 1   1950  8101   6105.   5517.  4109.  2714.          7614.        10556.
## 2   1960 10232.  9013.   8936.  7024.  6044.          9573.        13158.
## 3   1970 14202. 13067.  12267. 10785. 11389.         11923.        16745.
## 4   1980 17323. 15698.  15044. 14062  15273.         14258.        20410 
## 5   1990 19324. 18322.  17198  17103. 19759.         17703.        24801.
## 6   2000 24006. 21254.  19621. 19164. 21269.         23188.        30107.
## 7   2010 24941  21477   20661  18520  21935          23777         30491

Gather and Spread

There is a life saver function, gather and its duality, spread in tidyverse package. May tidyverse be with you.

Let’s see our problem first. Assume you want to plot time series graph of the economies and compare them. To be able to compare them you must add a line, because each of the countries is a new variable:

I don’t have time to do it for all.

Instead, you can transform your data using gather as below:

##   Year Country GDPPC
## 1 1950  Canada  7291
## 2 1951  Canada  7533
## 3 1952  Canada  7833
## 4 1953  Canada  7984
## 5 1954  Canada  7699
## 6 1955  Canada  8201
##     Year       Country GDPPC
## 422 2005 United.States 30842
## 423 2006 United.States 31358
## 424 2007 United.States 31655
## 425 2008 United.States 31251
## 426 2009 United.States 29899
## 427 2010 United.States 30491

Now we have a more handy data. Each row contains GDP per capita values of each year and country combination. It is way easier to plot it:

In case you need a data that is wider, you can use spread:

##   Year Canada France Germany Italy Japan United.Kingdom United.States
## 1 1950   7291   5186    3881  3172  1921           6939          9561
## 2 1951   7533   5461    4206  3451  2126           7123         10116
## 3 1952   7833   5564    4553  3591  2336           7091         10316
## 4 1953   7984   5684    4905  3830  2474           7346         10613
## 5 1954   7699   5915    5247  3947  2582           7619         10359
## 6 1955   8201   6199    5797  4190  2771           7868         10897

Using pipes

Pipes are the way to write elegant and more readable codes. Without them you write the code, assign the output to a new data, then use that new data and write a code with it and so on.

Instead of assigning to a new data, we can pipe the output into another function and assign the output of all the pipe if we want:

## # A tibble: 6 x 3
##   Decade Country  GDPPC
##    <dbl> <chr>    <dbl>
## 1   1950 Canada   8101 
## 2   1960 Canada  10232.
## 3   1970 Canada  14202.
## 4   1980 Canada  17323.
## 5   1990 Canada  19324.
## 6   2000 Canada  24006.

or you can plot it directly:

Preprocessing Covid Confirmed Data

Sometimes life doesn’t give what we want, we want to drive a yatch in the Ontario lake and make barbecue along with good drinks. In such cases you have to work harder.

For example, in Covid Confirmed Cases dataset by John Hopkins University, the data is painfully and arbitrarily entered. Some entries are in form of 2020-03-17T18:33:03, some are 2020-03-17 18:33:03 and some are 03/17/2020:

##   X Province        Country            Date Confirmed Deaths Recovered
## 1 1          Mainland China 1/22/2020 17:00         1     NA        NA
## 2 2          Mainland China 1/22/2020 17:00        14     NA        NA
## 3 3          Mainland China 1/22/2020 17:00         6     NA        NA
## 4 4          Mainland China 1/22/2020 17:00         1     NA        NA
## 5 5          Mainland China 1/22/2020 17:00        NA     NA        NA
## 6 6          Mainland China 1/22/2020 17:00        26     NA        NA
##             X Province            Country                Date Confirmed Deaths
## 277131 277131                     Vietnam 2020-06-13 03:33:14       333      0
## 277132 277132          West Bank and Gaza 2020-06-13 03:33:14       489      3
## 277133 277133              Western Sahara 2020-06-13 03:33:14         9      1
## 277134 277134                       Yemen 2020-06-13 03:33:14       632    139
## 277135 277135                      Zambia 2020-06-13 03:33:14      1321     10
## 277136 277136                    Zimbabwe 2020-06-13 03:33:14       343      4
##        Recovered
## 277131       323
## 277132       414
## 277133         6
## 277134        28
## 277135      1104
## 277136        51

We have to write some line of codes before working with this pain:

##   X Province        Country            Date Confirmed Deaths Recovered
## 1 1          Mainland China 1/22/2020 17:00         1     NA        NA
## 2 2          Mainland China 1/22/2020 17:00        14     NA        NA
## 3 3          Mainland China 1/22/2020 17:00         6     NA        NA
## 4 4          Mainland China 1/22/2020 17:00         1     NA        NA
## 5 5          Mainland China 1/22/2020 17:00        NA     NA        NA
## 6 6          Mainland China 1/22/2020 17:00        26     NA        NA
##       X Province        Country                Date Confirmed Deaths Recovered
## 561 561          Mainland China 2020-02-02T23:43:02     11177    350       295
## 562 562          Mainland China 2020-02-02T18:03:05       661      0        32
## 563 563          Mainland China 2020-02-02T18:03:05       632      0        15
## 564 564          Mainland China 2020-02-02T18:03:05       493      2        10
## 565 565          Mainland China 2020-02-02T18:03:05       463      0        16
## 566 566          Mainland China 2020-02-02T18:03:05       340      0         7
##   X Province        Country                Date Confirmed Deaths Recovered
## 1 1          Mainland China 2020-01-22 17:00:00         1     NA        NA
## 2 2          Mainland China 2020-01-22 17:00:00        14     NA        NA
## 3 3          Mainland China 2020-01-22 17:00:00         6     NA        NA
## 4 4          Mainland China 2020-01-22 17:00:00         1     NA        NA
## 5 5          Mainland China 2020-01-22 17:00:00        NA     NA        NA
## 6 6          Mainland China 2020-01-22 17:00:00        26     NA        NA
##             X Province            Country                Date Confirmed Deaths
## 277131 277131                     Vietnam 2020-06-13 03:33:14       333      0
## 277132 277132          West Bank and Gaza 2020-06-13 03:33:14       489      3
## 277133 277133              Western Sahara 2020-06-13 03:33:14         9      1
## 277134 277134                       Yemen 2020-06-13 03:33:14       632    139
## 277135 277135                      Zambia 2020-06-13 03:33:14      1321     10
## 277136 277136                    Zimbabwe 2020-06-13 03:33:14       343      4
##        Recovered
## 277131       323
## 277132       414
## 277133         6
## 277134        28
## 277135      1104
## 277136        51

The column, confirmed, is the cumulative confirmed cases. For example, the subset of Turkey shows the numbers are increasing:

##             X Province Country                Date Confirmed Deaths Recovered
## 258600 258600           Turkey 2020-06-08 03:33:22    170132   4692    137969
## 262272 262272           Turkey 2020-06-09 03:33:03    171121   4711    141380
## 265946 265946           Turkey 2020-06-10 04:07:00    172114   4729    144598
## 269663 269663           Turkey 2020-06-11 03:33:41    173036   4746    146839
## 273387 273387           Turkey 2020-06-12 05:09:52    174023   4763    147860
## 277125 277125           Turkey 2020-06-13 03:33:14    175218   4778    149102

However, some countries that have their Province information has province-wise data, which is also cumulative:

##             X Province Country                Date Confirmed Deaths Recovered
## 258273 258273  Ontario  Canada 2020-06-08 03:33:22     32096   2502     24694
## 261946 261946  Ontario  Canada 2020-06-09 03:33:03     32395   2524     25002
## 265620 265620  Ontario  Canada 2020-06-10 04:07:00     32678   2536     25375
## 269326 269326  Ontario  Canada 2020-06-11 03:33:41     32936   2552     25956
## 273047 273047  Ontario  Canada 2020-06-12 05:09:52     33173   2563     26358
## 276785 276785  Ontario  Canada 2020-06-13 03:33:14     33378   2573     26698

We are more interested in the country-wise numbers, so we will calculate daily sums of Confirmed, Deaths and Recovered columns for each country. For this, we will need

  • Dates not Datetime (cut the time)
  • Drop Province Column
  • Group By Country and Date
  • and summarize by calculating the sums
## # A tibble: 6 x 5
## # Groups:   Date [1]
##   Date       Country        Confirmed Deaths Recovered
##   <date>     <fct>              <int>  <int>     <int>
## 1 2020-01-22 Hong Kong             NA     NA        NA
## 2 2020-01-22 Japan                  2     NA        NA
## 3 2020-01-22 Macau                  1     NA        NA
## 4 2020-01-22 Mainland China        NA     NA        NA
## 5 2020-01-22 South Korea            1     NA        NA
## 6 2020-01-22 Taiwan                 1     NA        NA

You can see that some countries has NA, because they have NA entries for some regions and numbers for other. In R, the sum of, e.g, NA + 5 = NA. We can use na.rm = T option:

## # A tibble: 6 x 5
## # Groups:   Date [1]
##   Date       Country        Confirmed Deaths Recovered
##   <date>     <fct>              <int>  <int>     <int>
## 1 2020-01-22 Hong Kong              0      0         0
## 2 2020-01-22 Japan                  2      0         0
## 3 2020-01-22 Macau                  1      0         0
## 4 2020-01-22 Mainland China       547     17        28
## 5 2020-01-22 South Korea            1      0         0
## 6 2020-01-22 Taiwan                 1      0         0

What about the G7 countries?

Kool Stuff

Is it slowing down?

A “watched but cannot find now” Youtube video is suggesting another approach to understand whether the spread is slowening. The basic idea is as follows. The population growth, whether it is human or bacteria or whatever, can roughly be modelled as below differential equation:

\[\frac{\mathrm{d}P}{\mathrm{d}t} = kP\]

or if we rearrange \[\frac{\mathrm{d}P}{P} = k\mathrm{d}t \Rightarrow \ln P=kt\]

The left side of the equation is the growth benchmarked to the current population and the right side is a constant and time. Interestingly if we take the integral of both sides, we arrive at \(\ln P = kt\).

Enough calculus. We will plot \(\ln P\) on x axis and \(\ln \mathrm d P\) on the y. If the shape is pretty much linear, then it is not slowening, if it there are points below the linear line then it these points indicate the slowening:

It looks we are getting better!

Deliverables

4. Summarizing Data

You will use pipes to create an aggregated data; i.e. by using something like select(...) %>% mutate(...) %>% .... See the “Using Pipes” section in the Lab-7. There is another example of usage in “Preprocessing Covid Confirmed Data”.

  • Pick a categorical variable. If there are too many separate categorical values in the variable (e.g. more than 7 plusminus 2), subset the data based on the values you are interested in.

Your pipe structure will consist of the following steps:

  1. Select at least three continuous variables (using the select function)
  2. Convert the date in your data to date object using as.Date or functions in lubridate package. You can use mutate function to overwrite the date.
  3. Add another mutate to mutate the data and create YearMonth or YearQuarter columns. (the ones using the Nuclear Explosions data can find the long YearMonthDay in the first column). Year-month can be created by extracting year and month and pasting them: paste0(year(...),"-", month(...))
  4. group_by the data based on YearMonth or YearQuarter you created.
  5. summarise the data by taking the average or sum using the output of the previous step. If average or sum doesn’t make sense you’ll lose mark. E.g. sum of covid infection rate per month doesn’t make sense but the average makes sense.

Assign the output to a value and present head(...) and tail(...).

5. Plot the Time Series

Plot the time series graphs of the previous data:

  • Choose one continuous variable (which you aggregated using sum or mean.)
  • Plot the time series of that data, but map the categorical variable to the colour, so that you will have separate lines in different colours.
  • (optional) Remove the legend and add the geom_text to indicate which line is what. (see the plot above “Kool Stuff” title in Lab-7).